-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled-agent] Give InstanceManager
a gun
#6915
Conversation
This is presently a draft, partially because it depends on #6911, but moreso because I'm wondering if it's actually the complete solution. Potentially, we might want to make a best-effort attempt to finish processing other requests, like creating a zone bundle, with a timeout so that the terminate request is always honored eventually. I'll keep working on this. |
}, | ||
// Requests to terminate the instance take priority over any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a silly question, but: What happens if this task is already stuck awaiting a response to one of the commands below? IIUC that's what we're seeing in #6911--the InstanceRunner
loop for some instance is stuck waiting on a Propolis request, so it's not looking at any of its message queues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, you're right, this will need to also select over any operation we do in any of these branches and the termination channel firing. I'll fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, @gjcolombo, commit 2a331ab changes this so that the termination channel firing will take priority over any in flight request to the instance, so even if that gets stuck, we will still pull the plug immediately.
I wondered if we wanted to add a grace period to allow the in-flight operation to finish, but after talking to @smklein, I don't think that's actually necessary, because terminate
is only used by the most forceful attempts to stop an instance (the vmm_unregister
API and killing an instance that was using an expunged disk). A "normal" attempt to stop the instance should go through instance_put_state
with the Stopped
state, which goes in the normal request queue.
7be3c00
to
2a331ab
Compare
2a331ab
to
da53ce9
Compare
as suggested by @smklein
Stacked on top of #6913
Presently, sled-agent sends requests to terminate an instance to the
InstanceRunner
task over the same request channel as all other requests sent to that instance. This means that theInstanceRunner
will attempt to terminate the instance only once other requests received before the termination request have been processed, and an instance cannot bew terminated if its request channel has filled up.This seems unfortunate. If an instance gets stuck, the fact that it's stuck should not prevent it from being stopped. Instead, requests to terminate the instance should be prioritized over other requests. This commit does that.